Automatic font generation without human experts is a practical and significant problem, especially for some languages that consist of a large number of characters. Existing methods for font generation are often in supervised learning. They require a large number of paired data, which are labor-intensive and expensive to collect. In contrast, common unsupervised image-to-image translation methods are not applicable to font generation, as they often define style as the set of textures and colors. In this work, we propose a robust deformable generative network for unsupervised font generation (abbreviated as DGFont++). We introduce a feature deformation skip connection (FDSC) to learn local patterns and geometric transformations between fonts. The FDSC predicts pairs of displacement maps and employs the predicted maps to apply deformable convolution to the low-level content feature maps. The outputs of FDSC are fed into a mixer to generate final results. Moreover, we introduce contrastive self-supervised learning to learn a robust style representation for fonts by understanding the similarity and dissimilarities of fonts. To distinguish different styles, we train our model with a multi-task discriminator, which ensures that each style can be discriminated independently. In addition to adversarial loss, another two reconstruction losses are adopted to constrain the domain-invariant characteristics between generated images and content images. Taking advantage of FDSC and the adopted loss functions, our model is able to maintain spatial information and generates high-quality character images in an unsupervised manner. Experiments demonstrate that our model is able to generate character images of higher quality than state-of-the-art methods.
translated by 谷歌翻译
We present a data-driven framework to automate the vectorization and machine interpretation of 2D engineering part drawings. In industrial settings, most manufacturing engineers still rely on manual reads to identify the topological and manufacturing requirements from drawings submitted by designers. The interpretation process is laborious and time-consuming, which severely inhibits the efficiency of part quotation and manufacturing tasks. While recent advances in image-based computer vision methods have demonstrated great potential in interpreting natural images through semantic segmentation approaches, the application of such methods in parsing engineering technical drawings into semantically accurate components remains a significant challenge. The severe pixel sparsity in engineering drawings also restricts the effective featurization of image-based data-driven methods. To overcome these challenges, we propose a deep learning based framework that predicts the semantic type of each vectorized component. Taking a raster image as input, we vectorize all components through thinning, stroke tracing, and cubic bezier fitting. Then a graph of such components is generated based on the connectivity between the components. Finally, a graph convolutional neural network is trained on this graph data to identify the semantic type of each component. We test our framework in the context of semantic segmentation of text, dimension and, contour components in engineering drawings. Results show that our method yields the best performance compared to recent image, and graph-based segmentation methods.
translated by 谷歌翻译
Over-parameterization of deep neural networks (DNNs) has shown high prediction accuracy for many applications. Although effective, the large number of parameters hinders its popularity on resource-limited devices and has an outsize environmental impact. Sparse training (using a fixed number of nonzero weights in each iteration) could significantly mitigate the training costs by reducing the model size. However, existing sparse training methods mainly use either random-based or greedy-based drop-and-grow strategies, resulting in local minimal and low accuracy. In this work, to assist explainable sparse training, we propose important weights Exploitation and coverage Exploration to characterize Dynamic Sparse Training (DST-EE), and provide quantitative analysis of these two metrics. We further design an acquisition function and provide the theoretical guarantees for the proposed method and clarify its convergence property. Experimental results show that sparse models (up to 98\% sparsity) obtained by our proposed method outperform the SOTA sparse training methods on a wide variety of deep learning tasks. On VGG-19 / CIFAR-100, ResNet-50 / CIFAR-10, ResNet-50 / CIFAR-100, our method has even higher accuracy than dense models. On ResNet-50 / ImageNet, the proposed method has up to 8.2\% accuracy improvement compared to SOTA sparse training methods.
translated by 谷歌翻译
Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 19 popularly used PLMs. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.
translated by 谷歌翻译
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
translated by 谷歌翻译
Data-efficient learning on graphs (GEL) is essential in real-world applications. Existing GEL methods focus on learning useful representations for nodes, edges, or entire graphs with ``small'' labeled data. But the problem of data-efficient learning for subgraph prediction has not been explored. The challenges of this problem lie in the following aspects: 1) It is crucial for subgraphs to learn positional features to acquire structural information in the base graph in which they exist. Although the existing subgraph neural network method is capable of learning disentangled position encodings, the overall computational complexity is very high. 2) Prevailing graph augmentation methods for GEL, including rule-based, sample-based, adaptive, and automated methods, are not suitable for augmenting subgraphs because a subgraph contains fewer nodes but richer information such as position, neighbor, and structure. Subgraph augmentation is more susceptible to undesirable perturbations. 3) Only a small number of nodes in the base graph are contained in subgraphs, which leads to a potential ``bias'' problem that the subgraph representation learning is dominated by these ``hot'' nodes. By contrast, the remaining nodes fail to be fully learned, which reduces the generalization ability of subgraph representation learning. In this paper, we aim to address the challenges above and propose a Position-Aware Data-Efficient Learning framework for subgraph neural networks called PADEL. Specifically, we propose a novel node position encoding method that is anchor-free, and design a new generative subgraph augmentation method based on a diffused variational subgraph autoencoder, and we propose exploratory and exploitable views for subgraph contrastive learning. Extensive experiment results on three real-world datasets show the superiority of our proposed method over state-of-the-art baselines.
translated by 谷歌翻译
我们介绍了第一个基于学习的可重建性预测指标,以改善使用无人机的大规模3D城市场景获取的视图和路径计划。与以前的启发式方法相反,我们的方法学习了一个模型,该模型明确预测了从一组观点重建3D城市场景的能力。为了使这种模型可训练并同时适用于无人机路径计划,我们在培训期间模拟了基于代理的3D场景重建以设置预测。具体而言,我们设计的神经网络经过训练,可以预测场景的重构性,这是代理几何学的函数,一组观点,以及在飞行中获得的一系列场景图像。为了重建一个新的城市场景,我们首先构建了3D场景代理,然后依靠我们网络的预测重建质量和不确定性度量,基于代理几何形状,以指导无人机路径计划。我们证明,与先前的启发式措施相比,我们的数据驱动的可重建性预测与真实的重建质量更加紧密相关。此外,我们学到的预测变量可以轻松地集成到现有的路径计划中,以产生改进。最后,我们根据学习的可重建性设计了一个新的迭代视图计划框架,并在重建合成场景和真实场景时展示新计划者的卓越性能。
translated by 谷歌翻译
动机,情感和行动是人类活动中相关的基本因素。尽管长期以来一直认为动机和情感是探索人们如何在人类活动中采取行动的核心,但几乎没有研究支持分析人类精神状态与行动之间的关系。我们介绍了第一项研究,该研究研究了基于语言的人类活动中建模动机,情感和行动的生存能力,即逗号(人类活动的认知框架)。在逗号的指导下,我们定义了三个自然语言处理任务(情感理解,动机理解和有条件的动作生成),并通过自动从故事常识中提取样本来建立一个具有挑战性的数据集冰雹。 NLP应用程序的实验结果证明了建模关系的有效性。此外,与现有方法相比,受逗号启发的模型可以更好地揭示动机,情感和行动之间的基本关系。
translated by 谷歌翻译
半监督学习(SSL)通过利用大量未标记数据来增强有限标记的样品来改善模型的概括。但是,目前,流行的SSL评估协议通常受到计算机视觉(CV)任务的约束。此外,以前的工作通常从头开始训练深层神经网络,这是耗时且环境不友好的。为了解决上述问题,我们通过从简历,自然语言处理(NLP)和音频处理(AUDIO)中选择15种不同,具有挑战性和全面的任务来构建统一的SSL基准(USB),我们会系统地评估主导的SSL方法,以及开源的一个模块化和可扩展的代码库,以对这些SSL方法进行公平评估。我们进一步为简历任务提供了最新的神经模型的预训练版本,以使成本负担得起,以进行进一步调整。 USB启用对来自多个域的更多任务的单个SSL算法的评估,但成本较低。具体而言,在单个NVIDIA V100上,仅需要37个GPU天才能在USB中评估15个任务的FIXMATCH,而335 GPU天(除ImageNet以外的4个CV数据集中的279 GPU天)在使用典型协议的5个CV任务上需要进行5个CV任务。
translated by 谷歌翻译
配备摄像机的无人机可以显着增强人类在3D空间中具有显着的可操作性,从而使人类感知世界的能力。具有讽刺意味的是,无人机的对象检测始终是在2D图像空间中进行的,这从根本上限制了其理解3D场景的能力。此外,由于缺乏变形模型,无法直接应用于为自动驾驶开发的现有3D对象检测方法,这对于具有敏感变形和小物体的遥远空中透视至关重要。为了填补空白,这项工作提出了一个名为DVDET的双视检测系统,以在2D图像空间和3D物理空间中实现空中单眼对象检测。为了解决严重的视图变形问题,我们提出了一个可训练的可训练的可训练的转换模块,该模块可以从无人机的角度正确地扭曲信息到BEV。与汽车的单眼方法相比,我们的转换包括一个可学习的可变形网络,可显式修改严重的偏差。为了应对数据集挑战,我们提出了一个名为AM3D-SIM的新的大规模模拟数据集,该数据集由AirSim和Carla的共模制成,以及一个名为AM3D-REAL的新的现实世界空中数据集,由DJI Matrice 300 RTK收集,在两个数据集中,都提供了3D对象检测的高质量注释。广泛的实验表明,i)空中单眼3D对象检测是可行的; ii)在仿真数据集中预先训练的模型受益于现实世界的性能,iii)DVDET也有益于汽车的单眼3D对象检测。为了鼓励更多的研究人员调查该领域,我们将在https://sjtu-magic.github.io/dataset/am3d/中发布数据集和相关代码。
translated by 谷歌翻译